The Natality.csv contains information of Natality in each state. This includes the baby’s gender, birth weight, mother’s age and total number of births. The Census2015.cvs file contains the median age of the total population in each state.
Below is our sessionInfo().
sessionInfo(package=NULL)
## R version 3.3.2 (2016-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 14393)
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] backports_1.0.5 magrittr_1.5 rprojroot_1.2 tools_3.3.2
## [5] htmltools_0.3.5 yaml_2.1.14 Rcpp_0.12.10 stringi_1.1.2
## [9] rmarkdown_1.3 knitr_1.15.1 stringr_1.1.0 digest_0.6.11
## [13] evaluate_0.10
source("../01 Data/ETL.R")
## Loading required package: readr
## Warning: package 'readr' was built under R version 3.3.3
## Loading required package: plyr
## Parsed with column specification:
## cols(
## State = col_character(),
## State_Code = col_integer(),
## Gender = col_character(),
## Gender_Code = col_character(),
## Race = col_character(),
## Births = col_integer(),
## Average_Birth_Weight = col_double(),
## Average_Age_Mother = col_double()
## )
## Classes 'tbl_df', 'tbl' and 'data.frame': 404 obs. of 8 variables:
## $ State : chr "Alabama" "Alabama" "Alabama" "Alabama" ...
## $ State_Code : int 1 1 1 1 1 1 1 1 2 2 ...
## $ Gender : chr "Female" "Female" "Female" "Female" ...
## $ Gender_Code : chr "F" "F" "F" "F" ...
## $ Race : chr "American Indian or Alaska Native" "Asian or Pacific Islander" "Black or African American" "White" ...
## $ Births : int 100 576 9116 19514 90 617 9313 20331 1179 565 ...
## $ Average_Birth_Weight: num 3202 3112 2947 3228 3317 ...
## $ Average_Age_Mother : num 27 30.6 26 27.4 26.5 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 8
## .. ..$ State : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ State_Code : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Gender : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Gender_Code : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Race : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Births : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Average_Birth_Weight: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Average_Age_Mother : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
## Classes 'tbl_df', 'tbl' and 'data.frame': 404 obs. of 8 variables:
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ State_Code : Factor w/ 51 levels "1","10","11",..: 1 1 1 1 1 1 1 1 11 11 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
## $ Gender_Code : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 1 1 ...
## $ Race : Factor w/ 4 levels "American Indian or Alaska Native",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Births : Factor w/ 375 levels "100","10091",..: 1 306 362 121 361 317 368 127 31 302 ...
## $ Average_Birth_Weight: Factor w/ 399 levels "2876.45","2926.51",..: 160 59 3 190 306 187 37 326 339 196 ...
## $ Average_Age_Mother : Factor w/ 280 levels "24.82","25.28",..: 76 243 25 99 51 241 26 97 39 137 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 8
## .. ..$ State : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ State_Code : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Gender : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Gender_Code : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Race : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Births : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Average_Birth_Weight: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Average_Age_Mother : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = ""): invalid factor
## level, NA generated
## [1] "State_Code"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "Births"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "Average_Birth_Weight"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## [1] "Average_Age_Mother"
## Warning in `[<-.factor`(`*tmp*`, is.na(x), value = 0): invalid factor
## level, NA generated
## Classes 'tbl_df', 'tbl' and 'data.frame': 404 obs. of 8 variables:
## $ State : Factor w/ 51 levels "Alabama","Alaska",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ State_Code : num 1 1 1 1 1 1 1 1 2 2 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 2 2 2 1 1 ...
## $ Gender_Code : Factor w/ 2 levels "F","M": 1 1 1 1 2 2 2 2 1 1 ...
## $ Race : Factor w/ 4 levels "American Indian or Alaska Native",..: 1 2 3 4 1 2 3 4 1 2 ...
## $ Births : num 100 576 9116 19514 90 ...
## $ Average_Birth_Weight: num 3202 3112 2947 3228 3317 ...
## $ Average_Age_Mother : num 27 30.6 26 27.4 26.5 ...
## - attr(*, "spec")=List of 2
## ..$ cols :List of 8
## .. ..$ State : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ State_Code : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Gender : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Gender_Code : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Race : list()
## .. .. ..- attr(*, "class")= chr "collector_character" "collector"
## .. ..$ Births : list()
## .. .. ..- attr(*, "class")= chr "collector_integer" "collector"
## .. ..$ Average_Birth_Weight: list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## .. ..$ Average_Age_Mother : list()
## .. .. ..- attr(*, "class")= chr "collector_double" "collector"
## ..$ default: list()
## .. ..- attr(*, "class")= chr "collector_guess" "collector"
## ..- attr(*, "class")= chr "col_spec"
## CREATE TABLE 01_DataNatality (
## Natality
## State varchar2(4000),
## Gender varchar2(4000),
## Gender_Code varchar2(4000),
## Race varchar2(4000),
## State_Code number(38,4),
## Births number(38,4),
## Average_Birth_Weight number(38,4),
## Average_Age_Mother number(38,4)
## );
require(readr)
require(plyr)
file_path = "../01 Data/Natality.csv"
natality <- readr::read_csv(file_path)
names(natality)
df <- natality
names(df)
str(df) # Uncomment this line and run just the lines to here to get column types to use for getting the list of measures.
measures <- c("State_Code","Births","Average_Birth_Weight","Average_Age_Mother")
dimensions <- setdiff(names(df), measures)
dimensions
# Get rid of special characters in each column.
# Google ASCII Table to understand the following:
for(n in names(df)) {
df[n] <- data.frame(lapply(df[n], gsub, pattern="[^ -~]",replacement= ""))
}
str(df)
na2emptyString <- function (x) {
x[is.na(x)] <- ""
return(x)
}
if( length(dimensions) > 0) {
for(d in dimensions) {
# Change NA to the empty string.
df[d] <- data.frame(lapply(df[d], na2emptyString))
# Get rid of " and ' in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="[\"']",replacement= ""))
# Change & to and in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern="&",replacement= " and "))
# Change : to ; in dimensions.
df[d] <- data.frame(lapply(df[d], gsub, pattern=":",replacement= ";"))
}
}
na2zero <- function (x) {
x[is.na(x)] <- 0
return(x)
}
# Get rid of all characters in measures except for numbers, the - sign, and period.dimensions, and change NA to 0.
if( length(measures) > 1) {
for(m in measures) {
print(m)
df[m] <- data.frame(lapply(df[m], gsub, pattern="[^--.0-9]",replacement= ""))
df[m] <- data.frame(lapply(df[m], na2zero))
df[m] <- lapply(df[m], function(x) as.numeric(as.character(x)))
#df[m] <- data.frame(lapply(df[m], as.numeric(as.character))) # This is needed to turn measures back to numeric because gsub turns them into strings.
}
}
str(df)
write.csv(df, gsub("Natality", "Natality", file_path), row.names=FALSE, na = "")
tableName <- gsub(" +", "_", gsub("[^A-z, 0-9, ]", "", gsub(".csv", "", file_path)))
sql <- paste("CREATE TABLE", tableName, "(\n Natality \n")
if( length(measures) > 0 || ! is.na(dimensions)) {
for(d in dimensions) {
sql <- paste(sql, paste(d, "varchar2(4000),\n"))
}
}
if( length(measures) > 0 || ! is.na(measures)) {
for(m in measures) {
if(m != tail(measures, n=1)) sql <- paste(sql, paste(m, "number(38,4),\n"))
else sql <- paste(sql, paste(m, "number(38,4)\n"))
}
}
sql <- paste(sql, ");")
cat(sql)
The cleaned up data file, Natality.csv, is on conneyc’s account under dataset S17 DV Project 5. The TAs have been added as contributors and can download the file from the link provided below.
The data file containing the census data we are using, Census2015.csv, is on conneyc’s account under dataset S17 DV Project 5. The TAs have been added as contributors and can download the file from the link provided below.
This visualization is a boxplot of state vs births. It is filtered by Race and Births and has the detail of the gender of the births. It is observed that California has the largest number of births for female and males followed by Texas.
This is the same data visualization done in Shiny.
This Treemap displays the count of states that fall in each average age for mothers (in 0.5 bins), by race. The graph reveals the most common to least common age for giving birth in each race. Boxes are colored by the count, with the race, and average age, and count number as text. For Asians, it is seen that most mothers are 30.5 years old, white mothers are 28, African Americas are 26, and Natives are 26.5. Asian mothers have babies the latest while African Americans have them the earliest.
This faceted Shiny histogram is similar to the Tableau treeplot, since it shows the most common average mother age for each race. However, it also includes a red average line for each race, so you can easily see that the average age of mother is highest for Asian mothers and lowest for African Americans.
This scatterplot compares the median age and number of births of each state. Each point signifies the median age of the state and number of births of each gender for a specific race within that state. The shape of the points classify the race and the color classifies the state. A trend line was added which shows Births = -682.411*Median_Age + 35676.1, meaning as the median age of the state increases, there appears to be a decrease in the number of births.
This is the same visualization done in Shiny.
This graph shows average birth weight per state by race. It is colored by the calculated field (KPI): AVG(average birth weight)/4000. The number 4000 is an average healthy birth weight in the US. The data reflects that white babies tend to weigh more than the other races.
This is the same visualization done in Shiny.
This graph was created by joining Natality.csv with Census2015.csv through State= AreaName. It shows the total number of births for each race per state using the data from Natality.csv. The number of births for corresponding states is colored by the Median_Age column from Census2015.csv using parameters. The parameters are low: 0-34.7, medium: 34.7-39.3, and high: 39.3+. Regardless of median age, Whites have the largest number of births followed by Black or African, Asian or Pacific Islander then American Indian or Alaska Native. However, this trend is most likely a reflection of the existing demographic where the majority of the population is White. Texas is a low Median age state and has a high number of births.
This is the same visualization done in Shiny.
This graph shows average birth weight (in grams) for each race in each state. There is a reference line that is the average birth weight for the state, and the calculated difference from average is applied. The calculated difference is the average birth weight (for that race in that state) minus the total average birth weight of all the states. It is used for color of the bars and text and displayed by the mark labels. Numbers are shown when the mouse is hovered over each bar. The data reflects that White babies and American Indian/Alaskan babies tend to weigh more than the average baby for each state.
This is the Shiny version. The states chosen are states in the deep south, and the trend of White and Native babies weighing more continues for these states. The average line shows the average birth weight for the state, and the calculated difference is the difference between the average birth weight in the state and the average birth weight for the specific race in the state.
We created a set called high median age, which are the states with a median age of over 40. We created another set called low median age which are states with a median age of 35.8 and under. This graph mapped those two sets of states. We observed that the high median age states are all in the east coast while the low median age states tend to be in the central and western US.
This graph used data from Census2015 (Median Age) and Natality (Births, Race) joined by State. It shows the sum of births in the 7 Highest and 7 Lowest Median Age States. The red bars are the low median age states while blue are the high ones. Each bar is labeled with their median age. Maine has the highest median age at 43.80 while Utah has the lowest at 30.10 years old. There are low and high birth numbers in both the high and low median age states, so there isn’t an apparent direct correlation between median age and the number of births. However, it is interesting that California and Texas as low median age states have the highest birth numbers; in total the low median age states have a larger sum of births than the high median age states.
The Shiny version is similar for this visualization.
Our visualizations are published in an application on this account: